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Abstract 



This paper reports about an approach to the classification of proteins' 
primary structures taking advantage of the Self Organizing Maps algorithm 
and of a numerical coding of the aminoacids based upon their physico- 
chemical properties. 

Hydrophobicity, volume, surface area, hydrophilicity, bulkiness, refrac- 
tivity and polarity were subjected to a Principal Component Analysis and 
the first two principal components, explaining 84.8 % of the total observed 
variability, were used to cluster the aminoacids into 4 or 5 classes through 
a k-means algorithm. This leads to an economical representation of the 
primary structures which, in the construction of the input vectors for the 
Self Organizing Maps algorithm, allows the consideration of up to tri- and 
tetrapeptides' frequency matrices with minimal computational overload. 

In comparison with previously explored conditions, namely symbolic cod- 
ing of aminoacids and dipeptides frequencies, no significant improvement 
was observed in the classification of 69 cytochromes of the c type, char- 
acterized by a high degree of structural and functional similarity, while a 
substantial improvement occurred in the case of a data set including quite 
heterogeneous primary structures. 



1 Introduction 



Coding the primary structure of proteins by lists of numbers related to 
the physico-chemical properties of the aminoacids (AAs) in the polypeptide 
chains should provide substantial help in the study of the correlations be- 
tween primary and tridimensional structures (Eisenhaber et al. 1995; Rost 
and Sander 1993; Reyes et al. 1994), and hopefully shade some light on 
the intricacies of the rules governing proteins' folding (Fedorov and Baldwin 
1997). 

Although the issue is in the literature since a long time (Argos 1987; 
Schneider and Wrede 1993), the vast majority of the software tools devoted 
to the analysis of the primary structure (Thompson et al. 1994; Wishart 
et al. 1994) utilize the symbolic coding of AAs, the main reason being the 
successful drawing of phylogenetic trees on the basis of homologous proteins 
of different species after proper alignment (Page 1996). 

Numerical coding of aminoacidic residues on solid physico-chemical and 
statistical grounds, however, allows to take advantage of a manifold of nu- 
merical multivariate data-analysis techniques and, in particular, to fully 
exploit the euristic power of automatic classification based upon Self Orga- 
nizing Maps (SOMs), introduced by Kohonen several years ago (Kohonen 
1984) as a general purpose tool for classifying the elements of a multivariate 
set. The only strict requirement of their unsupervised learning mechanism, 
i.e. the same number of variables for each element of the set, can be easily 
met even if the primary structures to be classified are of different length. To 
any protein, in fact, can be associated a frequency matrix of n d elements, 
where each element is the number of occurrences of each of the possible 
oligopeptides of length d within the primary structure (n = 20 in the case 
of the natural AAs). On the basis of this approach, assuming a different 
symbol for each of the 20 natural aminoacids and d = 2, i.e. generating 
frequency matrices of 20 2 elements, it was possible to carry out both fine 
classifications within sets of structurally similar proteins (Ferran and Fer- 
rara 1991,1992), and coarser classifications over much larger sets (Ferran et 
al. 1992). 

If, on one hand, increasing the length d of the oligopeptide accounts with 
higher and higher precision for the fine details of each individual primary 
structure, the exponential increase in the number of the possible d-plets in 
the n d matrix poses some practical and theoretical limitations. The former 
ones obviously refer to the computational load, while the latter are related 
to the linearly decreasing number of oligopeptides (N — d + 1) with wich 
a sequence of length N may contribute to the non-zero, i.e. significant, 
elements of the frequency matrix. 

In this paper two exemplary cases of proteins' primary structure clas- 
sification will be described in which an appropriate balance between the n 
and d values in the frequency matrices feeding the SOM algorithm allows 
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to: i) use oligopeptides longer than dipeptides as descriptors of the primary 
structures, and ii) minimize the ensuing computational load by lowering the 
size of n with no (or minimal) loss of the statistically significant informa- 
tion, through the combined use of principal component and cluster analysis 
techniques. 



2 Methods 

2.1 Self Organizing Maps (SOM) 

The SOM algorithm, proposed by Teuvo Kohonen in the first 80s (Kohonen 
1984), is a fully automatic algorithm that drastically reduces the dimension- 
ality of a highly multivariate data set still preserving the mutual correlations 
between its elements. The most recent implementation of such algorithm 
(SOMPAK 3.1, free software available, together with a rich bibliography, 
at the Web site http://www.cis.hut.fi/nnrc/) has been used throughout 
the present paper. 

In our case the input of the algorithm is a set of numerical vectors ob- 
tained by an appropriate recoding of the primary structures of proteins, and 
the output is a bidimensional map where the mutual locations of the pri- 
mary structures reflect their intrinsic similarities. An extensive and clear 
description of the algorithm's working machinery is available in the litera- 
ture (Kohonen 1995), where an estimate of the distorsion introduced in the 
original structure of the data set by reducing their dimensionality is given 
in the form of a stress factor. As a more specific index of the goodness of 
the classification obtained in the case of proteins, the Map Mean Homology 
(MMH) index (see below) has been used throughout this paper. 



2.2 Calculation of the MMH (Map Mean Homology) index 

To evaluate the goodness of the clustering provided by the SOM algorithm, 
the Map Mean Homology (MMH) index has been used, along the same line 
followed by Ferran and Ferrara (1991). Such index can be defined as 

n 

^ ' QJ^ Clusters 

MMH = — (1) 

n 

i.e. the average of the Quality Ratio values (QRi Clusters values) associated 
to the n clusters present on the map. A cluster is defined by the presence 
in a cell of at least two elements, and is extended to its first neighbours, 
counted only once. Thus, the QRp lusteTS for the i th cluster is defined as 
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m 

^2 WjQRif ouples 

QR .Clusters = 3=1 _ ^ 

3=1 

where j runs over the m couples associated to the i th cluster and to its first 
neighbours, weighted by Wj values of 1 and 0.5 in the former and latter 
case, respectively. 

2.3 Principal Component Analysis of AAs' physico-chemical 
properties. 

The Principal Component Analysis (PCA), introduced by Pearson in 1901, 
is a method of decomposing a correlation or covariance matrix in order to 
find the best association of points in space (Jolliffe 1986). 

The first goal of the principal components is to summarize a multivariate 
data set as accurately as possible using fewer uncorrelated variables. This 
can be achieved since the principal components are orthogonal to each other, 
thus removing any redundancy in the available information. The relation 
between the original variables and the principal components is expressed in 
terms of component loadings, i.e. the correlation coefficients of the original 
variables with the new ones (principal components). 

In this paper PCA has been carried out over seven physico-chemical 
properties of the 20 natural AAs, namely hydrophobicity, volume, surface 
area, hydrophilicity, bulkiness, refractivity index and polarity which, ac- 
cording to Schneider and Wrede (1993), are relevant in the identification 
of specific patterns along proteins' sequences. Among these properties, hy- 
drophobicity has been recently confirmed as by far the most important one 
in protein folding (Weiss and Herzel 1998). In Table [l] our PCA results are 
reported in terms of the components' loadings and of the percent variabil- 
ity explained by each component. The first and second components (PCI, 
PC2) explain 84.8% of the total variability and hence have been considered 
as reliable and non redundant representatives of the whole set of properties. 

2.4 k-means clusterization of AAs. 

The k-means algorithm is a semi-automatic procedure to identify classes 
within a given set of elements described by one or many variables (Everitt 
1980). Clusters emerge here from the structural characteristics of the data 
set, by maximizing the interclass variance and minimizing the intraclass vari- 
ance. For n units described by m variables, the procedure can be schema- 
tized as follows: 

1. a non-trivial number of classes, k, is defined, being 1 < k < n; 
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2. k aggregation points in an m-dimensional space are arbitrarily chosen; 

3. each of the n units is assigned to the nearest aggregation point; 

4. a new set of aggregation points are reckoned as barycentres of the 
classes defined in the previous step; 

5. go back to the 3 rd step until no further change occurs in the classes' 
composition. 

The external factor which makes the procedure non fully automatic, is 
the a priori definition of k. 

In the present case, the algorithm has been used to group into k classes 
the 20 AAs on the basis of their hydrophobicity (m = 1), as well as the 
values of the first and second principal components (to = 2) extracted from 
their main physico-chemical properties. 

The relative optimality of the k value can be chosen by means of the 
relation between the fraction of explained variability (EV) relative to the 
classification, and the value of k: reaching a plateau of k can be considered 
the result of a structurally optimal classification (see also the legend to Table 

a- 

3 Results 

3.1 Data sets used in this paper 

The leading criterium in the choice of the two data sets used in this work 
reflects the aim to test the performance of a numerical coding of the AA and 
of a variable length of the oligopeptides' describing the primary structures 
under two different conditions, namely a high and a low value of a global 
similarity index (see below). 

For Data Set I, shown in Table ||, were chosen 69 cytochromes of the c 
type, which are known to share a high level of both structural and functional 
similarity. To impose some rational constraint in the choice of Data Set II, 
where a high similarity in the primary structures was not a prerequisite, our 
attention focussed over a group of proteins in which, as shown by Alexandrov 
and Fisher (1996), a significant similarity in the tridimensional arrangements 
was unparalleled by any homology in the primary structures. The elements 
of Data Set II are listed in Table ^. 

It is worth stressing that the two data sets should be considered from 
two complementary viewpoints: 

i) since the differences between the elements in Data Set I consist in 
a number of gaps/point-mutations over essentially the same basic primary 
structure, any source of variability (information) related to structural and/or 
functional features, is expected to be minimal. Under these conditions even 
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the simplest symbolic coding blind to physico-chemical features can be ap- 
propriate; 

ii) the high heterogeneity within the elements of Data Set II, related to 
their quite different length, composition, function and primary structure, 
should be in favour of any classification task based on a numerical coding 
of the sequences. This introduces, however, new problems about choosing 
the optimal physico-chemical descriptors of the AAs, or about how to group 
them into clusters, on which heavily depends the classification's goodness. 

A quantitative estimate of the Set Mean Homology (SMH) among the 
n elements (in couples) within a set is given by 

n— 1 n 

SMH = i=1 / =i+ * (3) 
n(n- l)/2 v ; 

where the QRij are the QualityRatio values, i.e. correspond to the ele- 
ments of a triangular matrix generated as an intermediate result by the 
PILEUP program in the GCG suite of programs for the analysis of biose- 
quences (Doelz 1994). More precisely, each QRij is given by an estimate of 
the goodness of the alignment between the i,j elements in the data set as 
provided by : i) the Needleman-Wunsch algorithm (1969), and ii) a substitu- 
tion matrix of the BLOSUM type (Henikoff and Henikoff 1992), normalized 
by the number of residues of the shortest sequence between i and j. Notice 
that the procedure used in reckoning QRij refers to a symbolic coding of the 
natural AAs, i. e. matches the condition used as a reference (black bars) in 
Figure 2. However, high-quality classifications of primary structures can also 
be obtained upon clustering the AAs into 4 or 5 groups through a k-means 
algorithm, after an appropriate numerical coding provided by PCA. 



3.2 Classification of the data sets' elements. 

Figure 1 shows the map generated by the SOM algorithm in the case of Data 
Set I. This data set, due to the high level of similarity between the primary 
structures, constitutes a significant benchmark to test the fine discrimina- 
tion power of the algorithm. A very similar data set has been successfully 
analyzed by Ferran and Ferrara (1992) using a symbolic coding of the 20 
natural AAs and dipeptide frequencies, i.e. a vector of (20 2 ) components 
for each primary structure. At difference with these authors, we used a 
numeric coding for the AAs in the aim to: i) exploit the physico-chemical 
information characterizing each single residue; ii) increase the length of the 
oligopeptides; iii) minimize the computational burden by reducing the num- 
ber of classes in which the residues can be clustered. The main goal was to 
provide a more direct correlation between primary and tertiary structures' 
similarities. 
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A glance at Figure 1 indicates that even using vectors of 5 3 components, 
corresponding to tripeptides' frequencies and to clustering the AAs into 
five groups, in the description of the primary structures, the phylogenetic 
relationships within cytochromes are very well preserved. 

A quantitative estimate of the classification goodness obtained by the 
SOM algorithm is provided in Figure 2 in terms of the Map Mean Homology 
{MMH, see methods) score for both Data Sets I and II. In each panel of 
Figure 2 is also indicated (dotted line) the Set Mean Homology (SMH, see 
Methods), i.e. an estimate of the overall similarity between all the couples of 
elements in the set. Under all conditions the bars' heigth exceeds the dotted 
line of an amount indicating the performance of the classifier algorithm. The 
bars in Figure 2 represent the values of the MMH for various combinations 
of: i) the coding criteria for the AAs; ii) the number of groups in which the 
AAs are clusterized; iii) the length of the oligopeptides whose frequencies 
constitute the vectors associated to each sequence. 

The most interesting result provided by our analysis is the striking dif- 
ference in the efficiency of the adopted coding scheme for the AAs, between 
the two data sets. Taking as a reference the previously used symbolic cod- 
ing coupled to dipeptide frequencies (black bars in Figure 2), substantially 
identical results have been obtained under all conditions when the data set 
included elements of high SMH (Figure 2A). Upon collapsing the latter 
constraint, however, a numerical coding based upon a PCA of their main 
physico-chemical properties (Table |J), and the ensuing techniques of cluster- 
ing; the AAs (Table |) into 4 or 5 groups, provided a worse performance and a 
better one in the case of, respectively, dipeptides and tripeptides frequencies 
(Figure 2B). 

To rationalize these results two basic points should be taken into account: 
first of all, it is quite obvious that, in very general terms, the ability of the 
SOM algorithm in finding peaks of similarity over a background of globally 
low similarity in the map is exalted. Such an effect is independent from the 
coding criteria of the residues and only deals with the specific features of 
the elements to be classified. It can be described by the expression: 

< MMH > -SMH 

SMH *- ' 

which, for the data shown in Figure 2A and B, gives the average values of 
0.19 ± 0.03 and 1.83 ± 0.97, respectively. 

Second, the much higher relative variance associated to the results in 
Figure 2B clearly indicates that the role of the coding criteria, namely i) 
oligopeptide length, and ii) optimized (through PCA) physico-chemical in- 
formation, is only emerging in the case of Data Set II. 

Finally, special consideration deserves the difference observed between 
the two data sets when the classification occurs after a random clustering of 
the AAs in 4, 5 or 10 groups (white columns in Figure 2). Such a condition 
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has been included in our analysis to clarify the relative importance of the 
symbolic coding of AAs (see Discussion) . 



4 Discussion 

In classifying proteins of different length on the basis of their polypeptide 
sequences a crucial problem consists in the appropriate coding of the AAs, 
since the appropriate statistical and connectionist procedures usually re- 
quire as an input numerical vectors of identical dimension. To overcome the 
problem a " units- variables" matrix may be worked out, where the rows are 
associated to the proteins and the columns contain, for example, the relative 
frequencies of the 20 natural AAs, or of dipeptides, tripeptides, etc., thus 
providing a more and more accurate (although longer) global description of 
the primary structures. In particular, such an approach has been applied 
in the use of a neural classifying algorithm, the SOM (see Materials), en- 
dowed with an automatic features' extraction ability in the absence of any 
indipendent information (unsupervised learning), with a minimum number 
of adjustable parameters. 

In this paper we showed that a synergic use of multivariate statistical 
techniques and of the SOM algorithm is very effective, mainly in the case of 
heterogeneous data sets, given an appropriate choice of the coding criteria 
for the AAs and of the length of the oligopeptides used to represent the 
primary structures. This clearly appears from the comparison of the upper 
and lower panels in Figure 2, referring to data sets of high and low mean 
homology, respectively. Under the former condition, as indicated by the high 
value of the SMH, all the explored criteria for primary structures's coding 
look almost equivalent. The improvement obtained with reference to the 
more traditional symbolic representation of AAs and dipeptides' frequencies 
is evident in the lower panel, where the data set includes elements of much 
lower SMH. 

This poses the question whether a further improvement could be ob- 
tained by further increasing the oligopeptides length, d. For both data sets 
used in this work this was actually not the case (not reported). The main 
reason is related to the exponential increase, with increasing d, of the size 
of the frequency matrices, coupled to a linear decrease in the number of 
oligopeptides associated to each primary structure of length N described 
over an alphabet of n different symbols (n = 20 for an unreduced symbolic 
representation of the 20 natural AAs). In other words, the ratio 

N - d d +l (5) 
n d K ' 

which represents the fraction of the non-zero elements in the frequency ma- 
trix for each polypeptide sequence, tends very rapidly to zero with increasing 
d. Thus, the sparsity of the cumulative matrix related the whole data set, 
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obtained from the element by element sum of the individual frequency ma- 
trices, should be considered as the main factor affecting the efficiency of 
the SOM classifier. Reducing to more favourable values expression [| by 
reducing n, i.e. clustering the AAs residues into relatively homogeneous 
groups, needs the adoption of a numerical coding for the residues, on the 
basis of their hydrophobicity (Cid 1982; Reyes 1993) or, even better, of the 
principal components extracted from a bulk of physico-chemical properties. 
The optimal number of such groups can be defined, in any case, through 
the Explained Variability index (see the legend to Table ||). A complemen- 
tary approach obviously consists in an appropriate filtering of the sparse 
matrices. 

A possible objection to the above sketched strategy could invoke the 
observed insensitivity to the various coding schemes in the classification of 
the primary structures included into Data Set I. This focusses our attention 
on the peculiar features of the elements of this data set, namely on their 
structural (at the tridimensional level) and functional homogeneity, which 
seems to pose an intrinsic limit to any substantial improvement in the classi- 
fication, even by increasing the oligopeptides' length. It was not possible in 
fact, under the explored conditions, to outperform the traditional symbolic 
coding of the residues coupled to dipeptide frequency matrices. A crucial 
observation in that respect, however, is that even after random grouping the 
residues into 4 or 5 classes the quality of the classification, as judged by the 
MMH index, was not decreased. This points to the conclusion that even a 
relatively poor symbolic coding is able to capture the only relevant source of 
information in this peculiar data set, which could be associated to a variabil- 
ity of syntactic type, i.e. related to local differences between the elements 
of the set (relatively) independent from their macroscopic function, since all 
of them share a common structural and functional backbone (Yockey 1977). 
In the absence of such common backbone, like in the case of Data Set II, 
where the substantial differences between the primary structures, give rise 
to a more semantic (i. e. related to macroscopic functional differences) 
variability, the numeric coding of AAs should be preferred to the symbolic 
one. It makes easier, in fact, by getting rid of the redundant information, to 
increase the length of the oligopeptides describing the primary structures, 
and hence a more accurate description of their global architecture, with 
substantial savings in terms of computational requirements. 

Up to what extent it is really worth to extend such length remains an 
open question. On the basis of a symbolic coding of the AAs, Strait and 
Dewey argued recently (1996) that the Conditional Information Entropy 
(Ik) of k-tuples of AAs, used to estimate the Information Entropy (/) of 
proteins' primary structures through the expression 

/ = lim I k (6) 

fc^oo 



s 



already reaches a limiting value for k equal to four. Among other things, 
these authors are also able to work out a figure for the fraction of the In- 
formation Entropy related to the tridimensional structure. Thus, it seem 
of great interest to check their theoretical conclusions against the results of 
an empirical approach based on the performance of SOM classifiers and a 
physico-chemical coding of the AAs. 
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Legends to tables and figures 



Figure 1: Classification of cytochromes of the c type by a SOM 
algorithm. 

The map is a graphical rearrangement of the output provided by the 
SOMPAK 1.3 program (see the text) on the cytochromes listed in Table ||. 
The input vectors, containing 5 3 = 125 components, have been constructed: 
i) grouping the AAs into 5 classes by a k-means algorithm on the basis of 
the first and second principal components extracted from 7 physico-chemical 
properties (see the text), and ii) using the tripeptides' frequency matrices. 

The hexagonal lattice of the map and its overall size (6x7 cells) are a 
compromise between the conditions used by Ferran and Ferrara (1992) and 
the Kohonen's suggestion to use different sizes for the map's axes. The 
working parameters of the SOMPAK program are the following: 

Lattice topology: hexagonal; Neighborood: bubble 

First ordering phase: 

learning rate = 0.05, 1000 epochs, starting radius 7 
Fine tuning phase: 

learning rate = 0.02, 10000 epochs, starting radius 2. 

The maps refers to the best results obtained, in terms of the internal dis- 
tortion parameter, over 40 different choices of the random initial conditions 
(see the Kohonen refs. for details) 



Figure 2: Performance of the SOM algorithm for proteins' 
classification under various conditions. 

Panels A and B refer to the proteins in Data Sets I and II (listed in 
Table Q and Table ||) and the histograms represent the MMH (Map Mean 
Homology) score as defined in the text. The working parameters of the 
SOMPAK program are the same listed in Figure 1 except that, in the case 
of Data Set II, the dimension of the maps was 5x4 due to the lower number 
of elements. 

The black, and white, bars refer to the unclustered natural, and ran- 
domly clustered AAs, respectively. The darker and lighter grey bars refer 
to clustering by hydrophobicity and, respectively, the PCI + PC2 extracted 
from physico-chemical properties (see the text). In the case of random clus- 
tering each bar is the average of ten randomizations and the error bars 
indicate one standard deviation. 
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Table [T]: PCA on seven physico-chemical properties of the nat- 
ural AAs. 

The table shows the correlations (loadings) between seven physico-chemical 
properties taken from Schenider and Wrede (1993) and the principal com- 
ponents extracted from them. The first row reports the percent of the total 
variability (EV%) of the whole set of properties explained by each compo- 
nent. 



Table |2|: Cytochromes of the c type used as Data Set I. 

Column 1 is a numeric identifier for the corresponding entrance, without 
the cytc prefix, in the SwissProt data-base (column 2). Columns 3 and 4 
refer, respectively, to the biological origin and the number of residues of 
each protein. The used family abbreviations are the following: Amphibia 
(Am), Angiosperm (Ap), Asteroidea (As), Birds (Av), Gastropoda (Ga), 
Chlophyceae (Ch), Euglenoid algae (Eu), Ascomycetes (Fa), Basidiomycetes 
(Fb), Deuteromycetes (Fd), Gymnosperm (Gp), Insects (In), Mammals 
(Ma), Oligochaeta (01), Agnatha (Pa), Chondrichthyes (Pc), Osteichthyes 
(Po), Protozoa (Pr), Reptiles (Re). 



Table |||: Immunoglobulin-like fold proteins used as Data Set 

II. 

The first four columns contain the same type of information as in Table 
^. Notice that the primary structures have been obtained from the PDB 
data-bank in this case. The full proteins names are listed in column 5. 



Table Clustering of the 20 natural AAs according to different 
criteria. 

The first two columns refer to the variable(s) upon which the clustering 
into 4, 5 or 10 classes has been carried out by the k-means algorithm. In 
each case the value of the percent of the explained variability (EV%) has 
been calculated as the following ratio : EV% = varBetw+VarWith > wnere 
VarBetw and VarWith are, respectively, the variability between the bari- 
centers of the classes and the mean variability within each class. The last 
column provides an example of a "random clustering" of the 20 AAs into 
the same number of classes. 
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Table 1: PC A on seven physico-chemical properties of the natural AAs. 





PCI 


PC2 


PC3 


PC4 


PC5 


PC6 


PC7 


EV% 


50.04 


34.73 


7.43 


5.29 


1.90 


0.47 


0.14 


Hydrophobicity 


0.231 


0.953 


0.865 


-0.560 


0.857 


0.863 


-0.047 


Volume 


-0.940 


0.239 


0.466 


0.736 


-0.146 


0.188 


0.821 


Surface Area 


-0.209 


0.025 


0.020 


0.357 


0.285 


-0.071 


-0.512 


Hydrophilicity 


0.052 


0.067 


-0.023 


-0.017 


0.362 


-0.423 


0.229 


Bulkiness 


0.023 


-0.142 


-0.172 


0.064 


0.180 


0.192 


0.096 


Refractivity 


0.120 


0.068 


-0.012 


0.113 


-0.028 


0.006 


0.016 


Polarity 


0.030 


-0.063 


0.067 


0.015 


0.007 


-0.003 


0.003 
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Table 2: Cytochromes of the c type used as Data Set I. 



Id 


Code 


Species Fam. 


Length 


Id 


Code 


Species Fam. 


Length 


1 
1 


ranca 


— — — — — - — 

ixana Oatesb. Am 


i o/i 

1U4 


OD 


r 

schpo 


— q~~ T? 5 u — r? 

Schizosac. rombe ra 


lUo 


2 




A r'OT' l\T a in i ti A n 
IN tJt; liil. rip 


112 


O ( 


llallcLll 


T-TciviGon A 1 1 1 1 1 1 1 h q 


luy 


Q 
O 


fages 


Fagopyrum Escul. Ap 


i no 


oo 


issor 


Issatchen. Ori. Fa 


1 HQ 

iuy 


A 
4 


ricco 


Ricinus Comm. Ap 


1 07 
1U / 


oy 


neucr 


Neurosp. Cr. Fa 


1U t 





braol 


Brassica Oler. A.p 


ill 

in 


/in 
411 


torha 


Torulasp. Hans. Fa 


1 HQ 

iuy 


u 


aruma 


A rn m \ it q i 1 A f\ 
/\I Lllll IVleLCUl. -rt-P 


1 flQ 


A1 


ustsp 


1 1 CT 1 1 O 111, '■m"* ti o ii|' Ij l-i 

usiiidgo opnaci . r u 


1 07 


7 


bdllllll 




111 

in 


A9 

L ±L 


Liieia 


i iieiinuiiiy. Ijctll. F LI 


1 1 1 
ill 


Q 

O 


cans a 


Cann. Sativa Ap 


1 0/1 
1U4 


4o 


ginbi 


^jinKgo oiioDa 


i ri7 

1U ( 


Q 


abuth 


AKn + il T^Tt rn"iT~*Vi t An 
JT.UUL11. _L IlfciUpilI . -ri-P 


i ns 

1UO 


AA 




OcLIUld V^yilLIllcL ±11 


1 f)7 

1U t 


i n 

1U 


nigda 


Nigel. Damasc. Ap 


i oi 

1U1 


/i c; 

40 


schgr 


Schistoc. Greg. In 


1 H7 
1U ( 


1 1 


allpo 


Allium Porrurn A.p 


i 

1UO 




UULpt 


i~i/iH f rl | ^ 1 1 1 • 1 vi 


1 07 




maize 


Zjea iviays a\j 


i no 
iuy 


4 f 


luccu 


Lucilia Cupr. In 


1 07 




T"\ l-l A A 1 I 

\Jliaa\A 


I-* rl O l_ /- 1 / 1 1 1 1 Q All A Y"\ 

L IldSCOlllb /\U. 


111 

in 


4o 


apimc 


/\pis ivien. in 


1 07 


14 


troma 


Tropaeol. Majus Ap 


109 


49 


haeir 


Haematob. Irrit. In 


107 


15 


passa 


Pastin. Sativa Ap 


107 


50 


macma 


Macrobrac. Mai. In 


104 


16 


soltu 


Solanum Tuber. Ap 


111 


51 


manse 


Manduca Sexta In 


107 


17 


cucma 


Cucurb. Max. Ap 


111 


52 


canfa 


Canis Famil. Ma 


104 


18 


orysa 


Oryza Sativa Ap 


111 


53 


equas 


Equus Asinus Ma 


104 


19 


sesin 


Sesamum Indie. Ap 


108 


54 


horse 


Equus Caball. Ma 


104 


20 


gosba 


Gossypium Barbad. Ap 


108 


55 


human 


Homo Sapiens Ma 


104 


21 


spiol 


Spinacia Oler. Ap 


111 


56 


minsc 


Miniopt. Schreib. Ma 


104 


22 


helan 


Helianth. Ann. Ap 


111 


57 


macmu 


Macaca Mulat. Ma 


104 


23 


lyces 


Lycopersicon Escul. Ap 


111 


58 


atesp 


Ateles Sp. Ma 


104 


24 


wheat 


Triticum Aestiv. Ap 


112 


59 


mirle 


Mirounga Leon. Ma 


104 


25 


astru 


Asterias Rub. As 


103 


60 


eisfo 


Eisenia Foetida Ol 


108 


26 


chick 


Gallus Gallus Av 


104 


61 


enttr 


Entosphcn. Trident. Pa 


104 


27 


anapl 


Anas Platyrhyn. Av 


104 


62 


squsu 


Squalus Sucklii Pc 


104 


28 


drono 


Dromaius N.-Holl. Av 


104 


63 


cypca 


Cyprin. Carpio Po 


94 


29 


strca 


Struthio Camel. Av 


104 


64 


katpe 


Katsuwon. Pelamis Pr 


103 


30 


aptpa 


Aptenodytes Patag. Av 


104 


65 


crifa 


Crithidia Fasc. Pr 


113 


31 


colli 


Columba Livia Av 


104 


66 


crion 


Crithidia Oncop. Pr 


112 


32 


helas 


Helix Aspersa Ga 


98 


67 


tetpy 


Tetrahymena Pyr. Pr 


109 


33 


chlre 


Chlamydom. Reinh. Ch 


111 


68 


croat 


Crotalus Atrox Re 


104 


34 


entin 


Enterom. Intest. Ch 


100 


69 


chese 


Chelydra Serp. Re 


104 


35 


euggr 


Euglena Gracil. Eu 


102 
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Table 3: Immunoglobulin-like fold proteins used as Data Set II. 



No 


PDB Id 


Source 


Length 


Protein's name 


1 


1ACX 


Actinomyces globisporus 


108 


Actinoxanthin 


2 


lCOB(A) 


Bovine erythrocytes 


151 


Superoxide dismutase 


3 


1CTM 


Turnip - Brassica rapa 


250 


Cytochrome f 


4 


1TEN 


Human 


90 


Tenascin 


5 


3HHR(B) 


Human 


197 


Human growth hormone 


6 


3DPA 


Escherichia coli 


218 


Pap D 


7 


2RHE 


Human 


114 


Bence- Jones protein 


8 


2MCG(1) 


Human 


216 


Immunoglobulin lambda 


9 


1MC0(L) 


Human 


216 


Immunoglobulin gl 


10 


1FAI(L) 


Mouse 


214 


Fab fragment 


11 


2FB4(H) 


Human 


229 


Immunoglobulin fab 


12 


8FAB(B) 


Human 


215 


Fab fragment 


13 


2FBJ(H) 


Mouse 


220 


Ig A fab fragment 


14 


1CDB 


Mouse 


105 


T lymphocyte adesion glycoprotein 


15 


1TLK 


Turkey gizzard 


103 


Telokin 


16 


lMCO(H) 


Human 


428 


Immunoglobulin gl 


17 


2IGE(A) 


Human 


320 


Fc fragment (theoretical model) 


18 


1PFC 


Guinea pig serum 


111 


Ig gl P F c(prime) fragment 


19 


1CID 


Rat 


177 


T-cell surface glycoprotein Cd4 


20 


3CD4 


Human 


178 


T-cell surface glycoprotein Cd4 


21 


1DLH(A) 


Human 


180 


Histocompatibility antigen Hla-drl 


22 


1DLH(B) 


Human 


188 


Histocompatibility antigen Hla-drl 


23 


3HLA(A) 


Human 


270 


Histocompatibility antigen Hla-a2 
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Table 4: Clustering of the 20 natural AAs according to different criteria. 





Hydrophobicity 


PCI + PC2 


Random 


4 Clusters 


1 


ACGILMFPSTWV 


C I L M P T V 


R 


2 


DEK 


R N D Q E H K 


N A S Y G H 


3 


N Q H Y 


AGS 


LMDFCWQE 


4 


R 


F W Y 


I V K P T 


EV : 


94% 


84% 




5 Clusters 


1 


A C I L M F W V 


C I L M P T V 


M N 


2 


DEK 


N D Q E H 


D F Y P W E 


3 


N Q H 


AGS 


L V A K T H 


4 


R 


F W Y 


SRC 


5 


G P S T Y 


R K 


I G Q 


EV : 


98% 


90% 




10 Clusters 


1 


I L V 


ILMV 


K 


2 


D K 


N D 


M D P 


3 


N Q 


A S 


Q 


4 


R 


F Y 


L N G 


5 


A C W 


R K 


V T 


6 


P Y 


C P T 


E 


7 


H 


W 


A F 


8 


GST 


G 


I Y H 


9 


M F 


Q H 


R C W 


10 


E 


E 


S 


EV : 


99.9% 


98% 
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Figure 1 
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